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FUNCTIONAL LINEAR REGRESSION 
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We study in this paper a smoothness regularization method for 
functional linear regression and provide a unified treatment for both 
the prediction and estimation problems. By developing a tool on si- 
multaneous diagonalization of two positive definite kernels, we obtain 
shaper results on the minimax rates of convergence and show that 
smoothness regularized estimators achieve the optimal rates of con- 
vergence for both prediction and estimation under conditions weaker 
than those for the functional principal components based methods 
developed in the literature. Despite the generality of the method of 
regularization, we show that the procedure is easily implementable. 
Numerical results are obtained to illustrate the merits of the method 
and to demonstrate the theoretical developments. 



1. Introduction. Consider the following functional linear regression mod- 
el where the response Y is related to a square integrable random function 
X(-) through 



Here ao is the intercept, T is the domain of X(-), Po(-) is an unknown slope 
function and e is a centered noise random variable. The domain T is assumed 
to be a compact subset of an Euclidean space. Our goal is to estimate ceo 
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and A)(0 as well as to retrieve 

(2) Vo (X):=a + J X(t)/3 (t)dt 

based on a set of training data (xi,y\), . . . , (x n , y n ) consisting of n indepen- 
dent copies of (X, Y). We shall assume that the slope function (3q resides in 
a reproducing kernel Hilbert space (RKHS) a subspace of the collection 
of square integrable functions on T ■ 

In this paper, we investigate the method of regularization for estimating 
r/o, as well as ao an d A)- Let H-n be a data fit functional that measures 
how well r\ fits the data and J be a penalty functional that assesses the 
"plausibility" of 77. The method of regularization estimates 770 by 

(3) f} nX = argmin[£ n (77|data) + XJ(r])], 

v 

where the minimization is taken over 

(4) j 77 : £ 2 (T) -»• R\v(X) = a + J^X(3 : a e R,/3 € h\, 

and A > is a tuning parameter that balances the fidelity to the data and the 
plausibility. Equivalently, the minimization can be taken over (a,/3) instead 
of 77 to obtain estimates for both the intercept and slope, denoted by a n \ 
and f3 n \ hereafter. The most common choice of the data fit functional is the 
squared error 

n 

(5) tn(r l ) = -Y;ly*-v(xi)} 2 - 

i=i 

In general, £ n is chosen such that it is convex in 77 and Ei n {rj) in uniquely 
minimized by 770 . 

In the context of functional linear regression, the penalty functional can 
be conveniently defined through the slope function /3 as a squared norm or 
semi- norm associated with %. The canonical example of T~L is the Sobolev 
spaces. Without loss of generality, assume that T = [0, 1], the Sobolev space 
of order m is then defined as 

W^([0,1]) = {/3:[0,l] -►R|/9, / 8W,...,0( ro - 1 > are absolutely 

continuous and 

There are many possible norms that can be equipped with to make it a 
reproducing kernel Hilbert space. For example, it can be endowed with the 
norm 

(6) ii/3|Ivv 2 ™ = E(/ p {q) ) 2 + 1 ^ {m) f- 

The readers are referred to Adams (1975) for a thorough treatment of 
this subject. In this possible choice of the penalty functional is 
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given by 

(?) m= !\^\t)] 2 dt. 

Jo 

Another setting of particular interest is T = [0,1] 2 which naturally occurs 
when X represents an image. A popular choice in this setting is the thin 
plate spline where J is given by 

and (xi,X2) are the arguments of bivariate function /3. Other examples of 
T include T = {1, 2,...,p} for some positive integer p, and unit sphere in 
an Euclidean space among others. The readers are referred to Wahba (1990) 
for common choices of % and J in these as well as other contexts. 

Other than the methods of regularization, a number of alternative estima- 
tors have been introduced in recent years for the functional linear regression 
[James (2002); Cardot, Ferraty and Sarda (2003); Ramsay and Silverman 

(2005) ; Yao, Muller and Wang (2005); Ferraty and Vieu (2006); Cai and Hall 

(2006) ; Li and Hsing (2007); Hall and Horowitz (2007); Crambes, Kneip and 
Sarda (2009); Johannes (2009)]. Most of the existing methods are based upon 
the functional principal component analysis (FPCA). The success of these 
approaches hinges on the availability of a good estimate of the functional 
principal components for X(-). In contrast, the aforementioned smoothness 
regularized estimator avoids this task and therefore circumvents assump- 
tions on the spacing of the eigenvalues of the covariance operator for X(-) 
as well as Fourier coefficients of /3q with respect to the eigenfunctions, which 
are required by the FPCA-based approaches. Furthermore, as we shall see in 
the subsequent theoretical analysis, because the regularized estimator does 
not rely on estimating the functional principle components, stronger results 
on the convergence rates can be obtained. 

Despite the generality of the method of regularization, we show that the 
estimators can be computed rather efficiently. We first derive a representer 
theorem in Section 2 which demonstrates that although the minimization 
with respect to r] in (3) is taken over an infinite-dimensional space, the 
solution can actually be found in a finite-dimensional subspace. This result 
makes our procedure easily implementable and enables us to take advantage 
of the existing techniques and algorithms for smoothing splines to compute 
r/n\, PnX and a nX . 

We then consider in Section 3 the relationship between the eigen struc- 
tures of the covariance operator for X(-) and the reproducing kernel of the 
RKHS T~L. These eigen structures play prominent roles in determining the 
difficulty of the prediction and estimation problems in functional linear re- 
gression. We prove in Section 3 a result on simultaneous diagonalization 
of the reproducing kernel of the RKHS T~L and the covariance operator of 
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X(-) which provides a powerful machinery for studying the minimax rates 
of convergence. 

Section 4 investigates the rates of convergence of the smoothness regular- 
ized estimators. Both the minimax upper and lower bounds are established. 
The optimal convergence rates are derived in terms of a class of intermediate 
norms which provide a wide range of measures for the estimation accuracy. 
In particular, this approach gives a unified treatment for both the prediction 
of tjq(X) and the estimation of /?o- The results show that the smoothness 
regularized estimators achieve the optimal rate of convergence for both pre- 
diction and estimation under conditions weaker than those for the functional 
principal components based methods developed in the literature. 

The representer theorem makes the regularized estimators easy to imple- 
ment. Several efficient algorithms are available in the literature that can be 
used for the numerical implementation of our procedure. Section 5 presents 
numerical studies to illustrate the merits of the method as well as demon- 
strate the theoretical developments. All proofs are relegated to Section 6. 

2. Representer theorem. The smoothness regularized estimators f) n \ and 
I3 n \ are defined as the solution to a minimization problem over an infinite- 
dimensional space. Before studying the properties of the estimators, we first 
show that the minimization is indeed well defined and easily computable 
thanks to a version of the so-called representer theorem. 

Let the penalty functional J be a squared semi-norm on H such that the 
null space 



is a finite-dimensional linear subspace of H with orthonormal basis {£i , . . . , 
£at} where N := dim(7^o)- Denote by Hi its orthogonal complement in H 
such that H = Ho © Hi- Similarly, for any function / G H, there exists a 
unique decomposition / = /o + /i such that /o G Ho and fi G Hi- Note 
Hi forms a reproducing kernel Hilbert space with the inner product of H 
restricted to Hi- Let K(-,-) be the corresponding reproducing kernel of Hi 
such that J(/i) = ||/i||ff = ||/i||^ for any fi G Hi. Hereafter we use the 
subscript K to emphasize the correspondence between the inner product 
and its reproducing kernel. 

In what follows, we shall assume that K is continuous and square inte- 
grable. Note that K is also a nonnegative definite operator on C.2- With 
slight abuse of notation, write 



Jr 

It is known [see, e.g., Cucker and Smale (2001)] that Kf G Hi for any / G £2- 
Furthermore, for any / G Hi 



(9) 



H :={PeH:J(/3)=0} 



(10) 




(11) 
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This observation allows us to prove the following result which is impor- 
tant to both numerical implementation of the procedure and our theoretical 
analysis. 

Theorem 1. Assume that £ n depends on 77 only through rj(xi),r](x2), • • • , 
rj(x n ); then there exist d = (di, . . . , (In)' £ K and c = (c\, . . . , c n )' G M n suc/i 

n 

(12) Pn\{t) = Y, d ktk(t) + Y, c i( Kx i)( t )- 

k=l i=l 

Theorem 1 is a generalization of the well-known representer lemma for 
smoothing splines (Wahba, 1990). It demonstrates that although the mini- 
mization with respect to rj is taken over an infinite-dimensional space, the 
solution can actually be found in a finite-dimensional subspace, and it suf- 
fices to evaluate the coefficients c and d in (12). Its proof follows a similar 
argument as that of Theorem 1.3.1 in Wahba (1990) where £ n is assumed to 
be squared error, and is therefore omitted here for brevity. 

Consider, for example, the squared error loss. The regularized estimator 
is given by 

2 "l 



Xi(t)f3(t)dt 



+ AJ(/3) 



(13) (& nX ,Pnx) = argmhJ -Y] Vi- [a+ I ■ 
It is not hard to see that 

(14) & nX = y- J x(t)p nX (t)dt, 

where x(t) = ^Y^=i x i(t) an d V = \Y^i=\Vi are the sample average of x 
and y, respectively. Consequently, (13) yields 

2 "I 



+ AJ(/3) 



(15) /3 n > = argmin< - Y] (yi-y)-f (xi(t)-x(t))/3(t)dt 

For the purpose of illustration, assume that % = Wf and J(/3) = J(/3") 2 . 
Then %o is the linear space spanned by = 1 and &(t) = t. A popular 
reproducing kernel associated with T~L\ is 

(16) K(s,t) = J-B 2 (s)B 2 (t) - lj3 4 (| a - t\), 

where B m (-) is the mth Bernoulli polynomial. The readers are referred to 
Wahba (1990) for further details. Following Theorem 1, it suffices to consider 
/3 of the following form: 

n , 

(17) P{t)=d 1 + d 2 t + ^2c i [xi(s)-x(s)]K(t,s)ds 



i=l 
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for some d £ M 2 and c € M n . Correspondingly, 
[X(t)-x(t)]P(t)dt 



T 



dxj [X(t)-x(t)]dt + d 2 j [X(t)-x(t)]tdt 
+ ^2 Ci / [xi(s)-x(s)]K(t,s)[X(t)-x(t)]dsdt. 



T JT 



i=l 

Note also that for (3 given in (17) 

(18) J(/3)=c'£c, 

where £ = is a n x n matrix with 



(19) Sij = 7 J [xi(s) - x(s)]K(t,s)[xj(t) - x(t)]dsdt. 
Denote by T = (Tij) an n x 2 matrix whose entry is 

(20) Tij = J [xi(t) - xit)]*- 1 dt 
for j = 1, 2. Set y = (y 1 , . . . ,y n )'. Then 

(21) e n ( v ) + AJ(/3) = -||y - (Td + Sc)||| + Ac'Sc, 

which is quadratic in c and d, and the explicit form of the solution can be 
easily obtained for such a problem. This computational problem is similar to 
that behind the smoothing splines. Write W = S + nXI; then the minimizer 
of (21) is given by 

d = {T'W~ 1 Ty l T'W- l y, 

c = w 1 ^ - T(r / w 1 r)~ 1 r / Ty~ 1 ]y- 

3. Simultaneous diagonalization. Before studying the asymptotic prop- 
erties of the regularized estimators fj n \ and (3 n \, we first investigate the re- 
lationship between the eigen structures of the covariance operator for X(-) 
and the reproducing kernel of the functional space %. As observed in earlier 
studies [e.g., Cai and Hall (2006); Hall and Horowitz (2007)], eigen structures 
play prominent roles in determining the nature of the estimation problem 
in functional linear regression. 

Recall that K is the reproducing kernel of "Hi. Because K is continuous 
and square integrable, it follows from Mercer's theorem [Riesz and Sz-Nagy 
(1955)] that K admits the following spectral decomposition: 

oo 

(22) K(s,t) = J2pkMs)Mt)- 

k=l 



RKHS APPROACH TO FLR 



7 



Here p\ > P2 > • ■ • are the eigenvalues of K, and {ipi,ipz, . . .} are the corre- 
sponding eigenfunctions, that is, 

(23) Kip k = p k ip k , k = l,2, — 
Moreover, 

(24) {^i,i>j)c 2 = 6ij and {$ i ^ i ) K = kjl Pj, 

where 5ij is the Kronecker's delta. 

Consider, for example, the univariate Sobolev space VU^QO, 1]) with norm 
(6) and penalty (7). Observe that 

<25) «.-{/**://«-0,*-M m-l}. 

It is known that [see, e.g., Wahba (1990)] 

(26) K(s,t) = —-B m { S )B m {t) + \ > B 2m (\s - 1 . 

Recall that B m is the mth Bernoulli polynomial. It is known [see, e.g., 
Micchelli and Wahba (1981)] that in this case, pk X k~ 2m , where for two 
positive sequences and X bk means that a^/bk is bounded away 

from and oo as k — > oo . 

Denote by C the covariance operator for X, that is, 

(27) C(s, t) = E{[X(s) - E(X(s))][X(t) - E(X(t))]}. 

There is a duality between reproducing kernel Hilbert spaces and covariance 
operators [Stein (1999)]. Similarly to the reproducing kernel K, assuming 
that the covariance operator C is continuous and square integrable, we also 
have the following spectral decomposition 

oo 

(28) C(s,t) = J2vk4>k(s)Mt), 

k=l 

where p,\ > p,2 > • ■ • are the eigenvalues and {(pi,(p2, ■ ■ •} are the eigenfunc- 
tions such that 

(29) Ccj> k := J C(-,t)<l> k (t)dt = p k (j> k , k = l,2,.... 

The decay rate of the eigenvalues {p k '-k> 1} can be determined by the 
smoothness of the covariance operator C . More specifically, when C satisfies 
the so-called Sacks- Ylvisaker conditions of order s where s is a nonnega- 
tive integer [Sacks and Ylvisaker (1966, 1968, 1970)], then W xr 2 ( s+1 '. 
The readers are referred to the original papers by Sacks and Ylvisaker or 
a more recent paper by Ritter, Wasilkowski and Wozniakwski (1995) for 
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detailed discussions of the Sacks- Ylvisaker conditions. The conditions are 
also stated in the Appendix for completeness. Roughly speaking, a covari- 
ance operator C is said to satisfy the Sacks- Ylvisaker conditions of order 
if it is twice differentiable when s ^ t but not differentiable when s = t. 
A covariance operator C satisfies the Sacks- Ylvisaker conditions of order r 
for an integer r > if d 2r C (s ,t) / (ds r dt r ) satisfies the Sacks- Ylvisaker con- 
ditions of order 0. In this paper, we say a covariance operator C satisfies the 
Sacks- Ylvisaker conditions if C satisfies the Sacks- Ylvisaker conditions of 
order r for some r > 0. Various examples of covariance functions are known 
to satisfy Sacks- Ylvisaker conditions. For example, the Ornstein-Uhlenbeck 
covariance function C(s, t) = exp(— \s — 1\) satisfies the Sacks- Ylvisaker con- 
ditions of order 0. Ritter, Wasilkowski and Wozniakowski (1995) recently 
showed that covariance functions satisfying the Sacks- Ylvisaker conditions 
are also intimately related to Sobolev spaces, a fact that is useful for the 
purpose of simultaneously diagonalizing K and C as we shall see later. 

Note that the two sets of eigenfunctions {tpi,tp2, • • •} and {(j>i,4>2, • • •} may 
differ from each other. The two kernels K and C can, however, be simultane- 
ously diagonalized. To avoid ambiguity, we shall assume in what follows that 
Cf^O for any / £ Ho and / / 0. When using the squared error loss, this 
is also a necessary condition to ensure that Ei n {rf) is uniquely minimized 
even if f3 is known to come from the finite-dimensional space T~Lq . Under this 
assumption, we can define a norm || • \\r in T~L by 

(30) \\f\\%=(Cf,f)c 2 + J(f)= [ f(s)C(s,t)f(t)dsdt + J(f). 

JTxT 

Note that || • \\r is a norm because defined above is a quadratic form 

and is zero if and only if / = 0. 

The following proposition shows that when this condition holds, || • is 
well defined on T~L and equivalent to its original norm, || • ||%, in that there 
exist constants < c\ < C2 < oo such that ci||/||# < < C2||/||_r for all 

/ € %. In particular, < oo if and only if ||/||-h < CO- 

PROPOSITION 2. // Cf 7^ for any f G %q and f / 0, then \\ ■ \\r and 
|| • ||% are equivalent. 

Let R be the reproducing kernel associated with || • ||#. Recall that R 
can also be viewed as a positive operator. Denote by {(p^, V4) — 1} the 
eigenvalues and eigenfunctions of R. Then R is a linear map from £2 to £2 
such that 

(31) Rip' k = J R(;t)ip' k (t)dt = p' k ip' k , fc = 1,2, — 

The square root of the positive definite operator can therefore be given as 
the linear map from £2 to £2 such that 

(32) ^ = (A) 1/a i k = l,2, — 
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Let v\ > v<i > ■ ■ ■ be the eigenvalues of the bounded linear operator R l / 2 CR}/ 2 
and {Cfc : k = 1,2, . . .} be the corresponding orthogonal eigenfunctions in £2- 

Write LOk = 1 ^ 2 i? 1 / 2 ^, k = 1,2, Also let (-,-)r be the inner product 

associated with || • \\r, that is, for any f,g£T-L, 

(33) (f,9)R = l(\\f + 9\\ 2 R-\\f-9\\ 2 R )- 
It is not hard to see that 

(34) (w^wk)* = v^v-^iR^C^R^Cktn = is^iCjXk}^ = v^Sjk, 
and 

= S jk . 

The following theorem shows that quadratic forms = (/, /) j? and {Cf, 

f)c 2 can be simultaneously diagonalized on the basis of {uJk : k > 1}. 

Theorem 3. For any f <E %, 

oo 

(35) f = ^2fkUk, 

k=l 

in the absolute sense where fk = Vk(f ,0Jk) r- Furthermore, if jk = (z-'/T 1 — 
1) , then 

oo oo 

(36) (/,/>* = £(1 + 7* ™d (Cf, f)c 2 = £ /|- 

fc=i fc=i 

Consentient?/, 

oo 

(37) J(/) = (/, f) R - (Cf, f) C2 = J2 7k 1 fl 

k=l 

Note that {(7fc,Wfc) : > 1} can be determined jointly by {(pk,ipk) -k > 
1} and {(nki 4>k) '■ k > 1}. However, in general, neither jk nor w fc can De 
given in explicit form of {(/?&, V'fc) '■ k > 1} and {(pk^k) - k> 1}. One notable 
exception is the case when the operators C and K are commutable. In 
particular, the setting ipk = 4>k, k = 1,2, is commonly adopted when 
studying FPCA-based approaches [see, e.g., Cai and Hall (2006); Hall and 
Horowitz (2007)]. 

Proposition 4. Assume that ipk = 4>k, k = 1,2, ... , then 7^ = pkPk o-nd 
-1/2 , 
Uk = p k yk- 
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In general, when t/jf. and 4>k differ, such a relationship no longer holds. 
The following theorem reveals that similar asymptotic behavior of 7^ can 
still be expected in many practical settings. 

Theorem 5. Consider the one- dimensional case when T= [0,1]. If % 
is the Sobolev space W™([0, 1]) endowed with norm (6), and C satisfies the 
Sacks-Ylvisaker conditions, then 7^ x ptPk- 

Theorem 5 shows that under fairly general conditions 7^. x PkPk- in this 
case, there is little difference between the general situation and the special 
case when K and C share a common set of eigenfunctions when working 
with the system {(7^,^), k = 1,2, . . .}. This observation is crucial for our 
theoretical development in the next section. 

4. Convergence rates. We now turn to the asymptotic properties of the 
smoothness regularized estimators. To fix ideas, in what follows, we shall 
focus on the squared error loss. Recall that in this case 



J 1 n 

(38) (a nX ,P n x) = argmin< - V" 



- I Xi{t)j3(t)dt 



a€R,/3<=H ^ - i=1 

As shown before, the slope function can be equivalently defined as 



+ \J{P) 



1 

(39) /3 nA = argmuW - } 
pan 



i=l 



(Vi-V)~ / (xi(t)-x(t))P(t)dt 



T 



+ AJ(/3) 



and once j3 n \ is computed, a n \ is given by 



(40) 



x(t)/3 nX (t)dt. 



T 



In light of this fact, we shall focus our attention on (3 n \ in the following 
discussion for brevity. We shall also assume that the eigenvalues of the re- 
producing kernel K satisfies pk x k~ 2r for some r > 1/2. Let F{s,M,K) 
be the collection of the distributions F of the process X that satisfy the 
following conditions: 

(a) The eigenvalues /i^ of its covariance operator C(-,-) satisfy p^ x k~ 2s 
for some s > 1/2. 

(b) For any function / G £2(7"), 

El^j f(t)[X(t)-E(X)(t)]dt' 

(41) 



r 

<M 



I f(t)[X(t)-E(X)(t)]dt 



2-1 2 
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(c) When simultaneously diagonalizing K and C, jk x Pkpk, where v\- = 
(1 +7 / 7 1 )" 1 is the &th largest eigenvalue of R 1 I 2 CR 1 / 2 where R is the 
reproducing kernel associated with || • defined by (30). 

The first condition specifies the smoothness of the sample path of X(-). 
The second condition concerns the fourth moment of a linear functional of 
X(-). This condition is satisfied with M = 3 for a Gaussian process because 
f f(t)X(t)dt is normally distributed. In the light of Theorem 5, the last 
condition is satisfied by any covariance function that satisfies the Sacks- 
Ylvisaker conditions if H is taken to be W™ with norm (6). It is also trivially 
satisfied if the eigenfunctions of the covariance operator C coincide with 
those of K. 

4.1. Optimal rates of convergence. We are now ready to state our main 
results on the optimal rates of convergence, which are given in terms of a 
class of intermediate norms between \\f\\x and 

1/2 

(42) ( / / f(s)C(s,t)f(t)dsdt' 



which enables a unified treatment of both the prediction and estimation 
problems. For < a < 1 define the norm || • || a by 



oo 



(43) ||/||2 = ^2(l + % a )fl 

k=l 

where = Vk{f,^k)R as shown in Theorem 3. Clearly ||/||o reduces to 
{Cf,f)c 2 whereas ||/||i = ||/||_r- The convergence rate results given below 
are valid for all < a < 1 . They cover a range of interesting cases including 
the prediction error and estimation error. 

The following result gives the optimal rate of convergence for the regular- 
ized estimator f3 n \ with an appropriately chosen tuning parameter A under 
the loss || • || . 

Theorem 6. Assume that E{ei) = and Var(ej) < M2. Suppose the 
eigenvalues of the reproducing kernel K of the RKHS rl satisfy X k~ 2r 
for some r > 1/2. Then the regularized estimator (3 n \ with 

(44) A x n ~ 2(r+s)/(2(r+s)+1) 
satisfies 

lim EE sup P(||/3 nA -/3 ||a>^~ 2(1 " a)(r+s)/(2{r+s)+1) ) 

(45) 

= 0. 



Note that the rate of the optimal choice of A does not depend on a. 
Theorem 6 shows that the optimal rate of convergence for the regularized 
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estimator (3 n \ is n~ 2 ^ l ~ a ^ T+s ^^ r+s ^ +l \ The following lower bound result 
demonstrates that this rate of convergence is indeed optimal among all es- 
timators, and consequently the upper bound in equation (45) cannot be 
improved. Denote by B the collection of all measurable functions of the 
observations {X\ , Y\ ) , . . . , (X n ,Y n ). 

Theorem 7. Under the assumptions of Theorem 6, there exists a con- 
stant d > such that 



Jim inf sup P(||/3-/3 ||a>^" 2(1 " a)(r+s)/(2(r+s)+1) )>0. 



rate optimal. 

The results, given in terms of || • || a , provide a wide range of measures of 
the quality of an estimate for (3q. Observe that 



where X* is an independent copy of X, and the expectation on the right- 
hand side is taken over X* . The right-hand side is often referred to as the 
prediction error in regression. It measures the mean squared prediction error 
for a random future observation on X. From Theorems 6 and 7, we have the 
following corollary. 

Corollary 8. Under the assumptions of Theorem 6, the mean squared 
optimal prediction error of a slope function estimator over F 6 T{s,M,K) 

2(r + s) 

and (3q £H is of the order n 2 (>-+ s )+ 1 and it can be achieved by the regularized 
estimator fl n \ with A satisfying (44)- 

The result shows that the faster the eigenvalues of the covariance operator 
C for A^(-) decay, the smaller the prediction error. 

When ipk = <ftk, the prediction error of a slope function estimator /3 can 
also be understood as the squared prediction error for a fixed predictor 
x*(-) such that \{x* c 2 \ x k~ s following the discussed from the last sec- 
tion. A similar prediction problem has also been considered by Cai and 
Hall (2006) for FPCA-based approaches. In particular, they established a 
similar minimax lower bound and showed that the lower bound can be 
achieved by the FPCA-based approach, but with additional assumptions 
that pLk — fJ-k+i > Cq 1 ^ 2 ' 3 ^ 1 , and 2r > 4s + 3. Our results here indicate that 
both restrictions are unnecessary for establishing the minimax rate for the 
prediction error. Moreover, in contrast to the FPCA-based approach, the 
regularized estimator /3 n \ can achieve the optimal rate without the extra 
requirements. 



n^oo /3gB F&F(s,M,K),j3 £H 




2{r+s)/(2(r+s)+l) 



IS 



(47) 
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To illustrate the generality of our results, we consider an example where 
T = [0, 1], % = W™([0, 1]) and the stochastic process X(-) is a Wiener pro- 
cess. It is not hard to see that the covariance operator of X, C(s,t) = 
min{s,t}, satisfies the Sacks- Ylvisaker conditions of order and therefore 
fik x k~ 2 . By Corollary 8, the minimax rate of the prediction error in esti- 
mating /3o is n -( 2m + 2 )/( r2m + 3 ) . Note that the condition 2r > 4s + 3 required 
by Cai and Hall (2006) does not hold here for m < 7/2. 

4.2. The special case of 4>k = ipk- It is of interest to further look into the 
case when the operators C and K share a common set of eigenfunctions. As 
discussed in the last section, we have in this case 4>k = ipk an d 7fc x / c - 2 ( r + s ) 
for all k > 1. In this context, Theorems 6 and 7 provide bounds for more 
general prediction problems. Consider estimating J x* (3$ where x* satisfies 
,<t>k)c 2 \ ~ k~ s+q for some < q < s- 1/2. Note that q < s- 1/2 is needed 
to ensure that x* is square integrable. The squared prediction error 

(48) ( / $(t)x*(t)dt- I Po(t)x*(t)dt ' 



is therefore equivalent to ||/3 — /3o||( s _ g )/( r+s ). The following result is a direct 
consequence of Theorems 6 and 7. 

Corollary 9. Suppose x* is a function satisfying \ {x* ,<pk) c 2 \ xk~ s+q 
for some < q < s — 1/2. Then under the assumptions of Theorem 6, 

lim inf sup p{( f (3(t)x* {t) dt - [ /3 (t)x* (t) dt] 

(49) 

> dn -^ +q )/{2 { r + s)+l) I > Q 

for some constant d > 0, and the regularized estimator j3 n \ with A satisfying 
(44) achieves the optimal rate of convergence under the prediction error (48). 

It is also evident that when ip^ = (j)^, \\ ■ \\ s /( r + s ) is equivalent to || • \\c 2 - 
Therefore, Theorems 6 and 7 imply the following result. 

Corollary 10. // 4>k = for all k > 1, then under the assumptions 
of Theorem 6 

(50) lim inf sup P(\\P - p \\c 2 > dn~ 2T /^ r+s ^) > 

n->oo /3GB F&F(s,M,K),I3 &H 

for some constant d > 0, and the regularized estimate f3 n \ with A satisfying 
(44) achieves the optimal rate. 

This result demonstrates that the faster the eigenvalues of the covariance 
operator for X(-) decay, the larger the estimation error. The behavior of the 
estimation error thus differs significantly from that of prediction error. 
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Similar results on the lower bound have recently been obtained by Hall 
and Horowitz (2007) who considered estimating /3q under the assumption 
that |(/5o,^>fe)£ 2 l decays in a polynomial order. Note that this slightly differs 
from our setting where (3q G T~L means that 

oo oo 

(51) Pk'iPo^h = E Pk 1 (Po, <t>k)c 2 < oo. 

k=l k=l 

Recall that x k~ 2r . Condition (51) is comparable to, and slightly stronger 
than, 

(52) |(/3o,^>£ 2 |<Af ^ r " 1/2 

for some constant Mq > 0. When further assuming that 2s + 1 < 2r, and 
Hk — Pk+i > Af ~ 1 A; _2 ' s_1 for all k > 1, Hall and Horowitz (2007) obtain the 
same lower bound as ours. However, we do not require that 2s + 1 < 2r 
which in essence states that /3q is smoother than the sample path of X. 
Perhaps, more importantly, we do not require the spacing condition /i^ — 
Hk+i > Mq" 1 ^" 28 " 1 on the eigenvalues because we do not need to estimate 
the corresponding eigenfunctions. Such a condition is impossible to verify 
even for a standard RKHS. 

4.3. Estimating derivatives. Theorems 6 and 7 can also be used for es- 
timating the derivatives of /3q. A natural estimator of the qth derivative of 
/?o , , is , the gth derivative of f3 n \ . In addition to (j>k = tpk, assume 
that ||V ; i' ? V' ( /'fc||oo ^ k q . This clearly holds when % = W™. In this case 

(53) \\^-$\\ C2 <cS-M{s+ q )Kr +s y 

The following is then a direct consequence of Theorems 6 and 7. 

Corollary 11. Assume that 4>k = V'fc and \\tp^ /i^k\\oo ~ k q for all k > 
1. Then under the assumptions of Theorem 6, for some constant d > 0, 

lim inf sup P{\\P {q) -Po ] \\ 2 c 2 > dn- 2 ^^ 2( - r+s ^) 

n-s-oo /9(«) GB F&T(s,M,K),Pa&'H 

(54) 

>0, 

and the regularized estimate f3 n \ with A satisfying (44) achieves the optimal 
rate. 

Finally, we note that although we have focused on the squared error loss 
here, the method of regularization can be easily extended to handle other 
goodness of fit measures as well as the generalized functional linear regression 
[Cardot and Sarda (2005) and Miiller and Stadtmuller (2005)]. We shall leave 
these extensions for future studies. 



RKHS APPROACH TO FLR 



15 



5. Numerical results. The Representer Theorem given in Section 2 makes 
the regularized estimators easy to implement. Similarly to smoothness reg- 
ularized estimators in other contexts [see, e.g., Wahba (1990)], r) n \ and j3 n \ 
can be expressed as a linear combination of a finite number of known ba- 
sis functions although the minimization in (3) is taken over an infinitely- 
dimensional space. Existing algorithms for smoothing splines can thus be 
used to compute our regularized estimators f) n \, j3 n \ and a n \. 

To demonstrate the merits of the proposed estimators in finite sample 
settings, we carried out a set of simulation studies. We adopt the simulation 
setting of Hall and Horowitz (2007) where T = [0, 1] . The true slope function 
/?o is given by 

50 

(55) /3 o = ^4(-l) fc+1 £T 2 fc , 

fc=i 

where <fii(t) = 1 and cf>k + i(t) = \pl cos(/c7ri) for k > 1. The random function 
X was generated as 

50 

(56) X = y^ j (kZk(pk, 

k=l 

where are independently sampled from the uniform distribution on 
[— \/3, \/3] and Ck are deterministic. It is not hard to see that C| are the 
eigenvalues of the covariance function of X . Following Hall and Horowitz 
(2007), two sets of Ck were used. In the first set, the eigenvalues are well 
spaced: Ck = ( — l) k+l k~ u / 2 with v = 1.1, 1.5,2 or 4. In the second set, 

fl, k = l, 

(57) Ck = { 0.2(-l) fc+1 (l -0.0001A:), 2 < k < 4, 

I 0.2(-l) fc+1 [(5L/c/5j)-^ 2 - 0.0001(/cmod 5)], k > 5. 

As in Hall and Horowitz (2007), regression models with e ~ N(0, a 2 ) where 
a = 0.5 and 1 were considered. To comprehend the effect of sample size, we 
consider n = 50, 100, 200 and 500. We apply the regularization method to 
each simulated dataset and examine its estimation accuracy as measured by 
integrated squared error \\/3 n \ — /3o||/: 2 and prediction error \\f3 n \ — /?o|lo- F° r 
the purpose of illustration, we take H = W| and J(/3) = J (/3") 2 , for which 
the detailed estimation procedure is given in Section 2. For each setting, the 
experiment was repeated 1000 times. 

As is common in most smoothing methods, the choice of the tuning pa- 
rameter plays an important role in the performance of the regularized es- 
timators. Data-driven choice of the tuning parameter is a difficult prob- 
lem. Here we apply the commonly used practical strategy of empirically 
choosing the value of A through the generalized cross validation. Note that 
the regularized estimator is a linear estimator in that y = H(X)y where 
y = (Vn\(xi), . . . ,fj n \(x n ))' and H(X) is the so-called hat matrix depending 
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Fig. 1. Prediction errors of the regularized estimator (a = 0.5): X was simulated with a 
covariance function with well-spaced eigenvalues. The results are averaged over 1000 runs. 
Black solid lines, red dashed lines, green dotted lines and blue dash-dotted lines correspond 
to v= 1.1, 1.5, 2 and 4, respectively. Both axes are in log scale. 



on A. We then select the tuning parameter A that minimizes 

(V")lly-y|lfo 



(58) GCV(A) 



(l-tr(tf(A))/n 



i2 ' 



Denote by A GCV the resulting choice of the tuning parameter. 

We begin with the setting of well-spaced eigenvalues. The left panel of 
Figure 1 shows the prediction error, \\f3 n \ — /3q\\q, for each combination of 
v value and sample size when a = 0.5. The results were averaged over 1000 
simulation runs in each setting. Both axes are given in the log scale. The 
plot suggests that the estimation error converges at a polynomial rate as 
sample size n increases, which agrees with our theoretical results from the 
previous section. Furthermore, one can observe that with the same sample 
size, the prediction error tends to be smaller for larger v. This also confirms 
our theoretical development which indicates that the faster the eigenval- 
ues of the covariance operator for X(-) decay, the smaller the prediction 
error. 
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Fig. 2. Estimation errors of the regularized estimator (a =0.5): X was simulated with a 
covariance function with well-spaced eigenvalues. The results are averaged over 1000 runs. 
Black solid lines, red dashed lines, green dotted lines and blue dash-dotted lines correspond 
to v= 1.1, 1.5,2 and 4, respectively. Both axes are in log scale. 

To better understand the performance of the smoothness regularized es- 
timator and the GCV choice of the tuning parameter, we also recorded the 
performance of an oracle estimator whose tuning parameter is chosen to 
minimize the prediction error. This choice of the tuning parameter ensures 
the optimal performance of the regularized estimator. It is, however, note- 
worthy that this is not a legitimate statistical estimator since it depends on 
the knowledge of unknown slope function [3q. The right panel of Figure 1 
shows the prediction error associated with this choice of tuning parameter. 
It behaves similarly to the estimate with A chosen by GCV. Note that the 
comparison between the two panels suggest that GCV generally leads to 
near optimal performance. 

We now turn to the estimation error. Figure 2 shows the estimation errors, 
averaged over 1000 simulation runs, with A chosen by GCV or minimizing the 
estimation error for each combination of sample size and v value. Similarly 
to the prediction error, the plots suggest a polynomial rate of convergence 
of the estimation error when the sample size increases, and GCV again leads 
to near-optimal choice of the tuning parameter. 
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Tuned with GCV Tuned with GCV 
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Fig. 3. Estimation and prediction errors of the regularized estimator (a 2 — l 2 ): X was 
simulated with a covariance function with well-spaced eigenvalues. The results are averaged 
over 1000 runs. Black solid lines, red dashed lines, green dotted lines and blue dash-dotted 
lines correspond to u = l.l, 1.5, 2 and 4, respectively. Both axes are in log scale. 

A comparison between Figures 1 and 2 suggests that when X is smoother 
(larger u), prediction (as measured by the prediction error) is easier, but 
estimation (as measured by the estimation error) tends to be harder, which 
highlights the difference between prediction and estimation in functional 
linear regression. We also note that this observation is in agreement with 
our theoretical results from the previous section where it is shown that 
the estimation error decreases at the rate of n~ 2r ^ 2 ^ r+s ^ +l ^ which deceler- 
ates as s increases; whereas the prediction error decreases at the rate of 
n -2(r+s)/(2(r+s)+i) w \ l [ c ] 1 accelerates as s increases. 

Figure 3 reports the prediction and estimation error when tuned with 
GCV for the large noise {a = 1) setting. Observations similar to those for the 
small noise setting can also be made. Furthermore, notice that the prediction 
errors are much smaller than the estimation error, which confirms our finding 
from the previous section that prediction is an easier problem in the context 
of functional linear regression. 

The numerical results in the setting with closely spaced eigenvalues are 
qualitatively similar to those in the setting with well-spaced eigenvalues. 
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Fig. 4. Estimation and prediction errors of the regularized estimator: X was simulated 
with a covariance function with closely-spaced eigenvalues. The results are averaged over 
1000 runs. Both axes are in log scale. Note that y-axes are of different scales across panels. 



Figure 4 summarizes the results obtained for the setting with closely spaced 
eigenvalues. 

We also note that the performance of the regularization estimate with 
A tuned with GCV compares favorably with those from Hall and Horowitz 
(2007) using FPCA-based methods even though their results are obtained 
with optimal rather than data-driven choice of the tuning parameters. 



6. Proofs. 

6.1. Proof of Proposition 2. Observe that 
(59) 



TxT 



f(s)C(s,t)f(t)dsdt<^\\f\\l 2 < Cl \\f\\ 2 n 



for some constant c\ > 0. Together with the fact that J(f) < 
conclude that 



we 



(60) \\f\? R 



TxT 



f(s)C(s,t)f(t)dsdt + J(f) < (d + l)\\f\\ 2 H . 
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Recall that k = 1, . . . , N, are the orthonormal basis of T~Lq. Under the 
assumption of the proposition, the matrix E = ({C^j,^k)u)i<j,k<N is a pos- 
itive definite matrix. Denote by n\ > > ■ • ■ > fi'^ > its eigenvalues. It is 
clear that for any /o £ Ho 

(61) ll/o|H>/4ll/o|&- 
Note also that for any fi GH\, 

(62) II/i||« = ^(/i)<II/i||r- 

For any / £ H, we can write / := /o + fi where fo € Ho and f± € "Hi- 
Then 

(63) ii/iii= / mcis^mdsdt+whwii. 

JTxT 

Recall that 

(64) n/ii^^^ii/iiil^prVr 1 / f (s)c(s,t)ut)dsdt. 

JTxT 

For brevity, assume that p\ = fj,± = 1 without loss of generality. By the 
Cauchy-Schwarz inequality, 



1 JTxT 

>\\ f (s)C( S ,t)fo(t)dsdt + l f h{s)C(8,t)h(t)d8& 
1 JTxT 1 JTxT 

- ^ fo(s)C(s,t)f Q (t)dsdt\ 1 



1/2 

h{s)C{s,t)hit)dsdt 

TxT 

>\\ fo(s)C(s,t)f (t)dsdt, 

'TxT 



3 



where we used the fact that 3a 2 / 2 — ab> —b 2 /6 in deriving the last inequal- 
ity. Therefore, 

(65) 4-II/III>II/oIIk- 

Together with the facts that ||/||^ = ||/o||^ + WhWy. and 

(66) ||/|||>J(/i)>||/i||^ 
we conclude that 

(67) ii/ni^a+s/^r 1 !!/^. 

The proof is now complete. 
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6.2. Proof of Theorem 3. First note that 

oo oo 

R- 1/2 f = E^" 1/2 /' = E(^ 1/2 /' vl /2 R- 1/2 u k )cA /2 R- 1/2 ^ 

k=l k=l 

= R ~ 1 ' 2 (E MR~ 1/2 f, R- 1 ' 2 ^)^ = R- 1/2 (jt >*(f> ^)r^ 

Applying bounded positive definite operator R 1 ! 2 to both sides leads to 

oo 

(68) / = ^v k (f,u k ) R uj k . 

k=l 

Recall that (oo k ,u}j) R = v k X o~kj- Therefore, 

/ oo oo \ 

\k=l j=l I R 

= E V k v j(f,U k )R(f,Uj) R (u k ,Uj) R 
k,j=l 

= ^2^k{f,^k)R- 
k=l 

Similarly, because (Cw k ,ujj) c 2 = 5 k j, 

(CfJ)c 2 = (cl^2v k {f,uj k ) R uj k J ,E^i(/^i)fl w i ) 

\ U=l / j=l I C 2 

I oo oo \ 

= ( ^ u k{f \u k ) R CuJ k ,^2uj(f \u)j) R U}j \ 
\k=l j=l I C 2 

= E \Uk) r(J R (Cu k ,U)j) c 2 

k,j=l 

= ^2"k(fi UJ k) 2 R- 
k=l 

6.3. Proof of Proposition 4- Recall that for any / £ Hq, Cf ^ if and 
only if / = 0, which implies that Ho n l.s.{4> k : k > 1} ± = {0}. Together with 
the fact that Ho n Hi = {0}, we conclude that H = Hi = l.s.{(j) k : k > 1}. It 
is not hard to see that for any f,g£H, 

(69) (f,9)R= [ f(s)C(s,t)g(t)dsdt+(f,g) K . 

JTxT 
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In particular, 

(70) (i>j,^k)R = (Vk + Pk^Sjk, 

which implies that {((p k + p k ipk) '■ k > 1} is also the eigen system of R, 
that is, 

oo 

(n) R( s ,t) = Y^iPk + p^M^Mt)- 

fc=i 

Then 

(72) Ri> k := J^R(-,t)Mt)dt = (pk + Pk l y 1 i>k, fc = 1,2, — 
Therefore, 

R^CR^ k = R l ' 2 C{(p k + p^ 1 )- 1 / 2 ^) 

= R^ifikfjiu + Pfc 1 )" 1/ Vfc) = (1 + Pfc Vfc Vfc, 

which implies that £ k = ip k = <f> k , v k = (l + p^Vfe 1 ) -1 and 7 fc = p fc p fc . Con- 
sequently, 

(73) w fc = v'^R 1 ' 2 ^ = ^7 1/2 (/U fc + ^ 1 )~ 1/2 Vfc = P~ k 1/2 A- 

6.4. Proof of Theorem 5. Recall that % = W™ , which implies that p k x 
k~ 2m . By Corollary 2 of Ritter, Wasilkowski and Wozniakowski (1995), 
\i k x A; _2 ( s + 1 ) . It therefore suffices to show j k x fc~ 2(s+i+m) _ r, e y Q f 
the proof is a result from Ritter, Wasilkowski and Wozniakowski (1995) in- 
dicating that the reproducing kernel Hilbert space associated with C differs 
from W2 +1 ([0, 1]) only by a finite-dimensional linear space of polynomials. 

1/2 

Denote by Q r the reproducing kernel for W2QO, 1]). Observe that Q r {£-2) = 
[e.g., Cucker and Smale (2001)]. We begin by quantifying the decay rate 

of \k{Qm 2 Qs+iQ l Ji 2 )- By Sobolev's embedding theorem, (QI^qU 2 )^) = 

Q 1 s+i(WZ l )=W% l+s+1 . Therefore, qU 2 Q s +iQU 2 is equivalent to Q m+s+l . 
Denote by X k (Q) be the kth largest eigenvalue of a positive definite operator 
Q. Let {h k : k > 1} be the eigenfunctions of Q m + s +i, that is, Q m+s+ ih k = 
^k(Qm+s+i)hk, k = 1, 2, . . . . Denote by JF k and the linear space spanned 
by {hj : 1 < j < k} and {hj :j > k+1}, respectively. By the Courant-Fischer- 
Weyl min-max principle, 

\ k (QU 2 Q s+ iQU 2 ) > mm iiQliWd'nhmh 

>C im m||Q^ + J||| 2 /||/|| 2 2 

> Ci\ k {Qm+ s +l) 
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for some constant C\ > 0. On the other hand, 



h(QU 2 Q s+ iQU 2 )< max \\Qli\QU 2 f\\l 2 /\\f\\h 



' 2 QU 2 f\\lJ\\f\\l. 

<C 2 mm \\Q]l 2 +s+1 f\\lj\\f\\l 2 



< C2^k{Qm+s+l) 

1 /2 1 /2 

for some constant C2 > 0. In summary, we have Xk(Qm Q s +iQm ) x 

x £-2(m+s+l)_ 

As shown by Ritter, Wasilkowski and Wozniakowski [(1995), Theorem 1, 

page 525], there exist D and U such that Q s +i = D + U, D has at most 

2(s + 1) nonzero eigenvalues and ||C/ 1 / 2 /||£ 2 is equivalent to HC 1 / 2 /!!^. 

Moreover, the eigenfunctions of D, denoted by {gi, . . . , g^} (d < 2(s + 1)) are 

polynomials of order no greater than 2s + 1. Denote Q the space spanned by 

1/2 

{gi, . . . ,gd}- Clearly Q C W™ = Qm (£2). Denote {/ij :j > 1} the eigenfunc- 
tions of Qm 2 Q s +iQm 2 ■ Let J^. and J^f- be defined similarly as and J-^-. 
Then by the Courant-Fischer-Weyl min-max principle, 



\ k - d (Q l l 2 UQl( 2 ) > min W^QlffW} /II nl 



/e.F fe nQ- 1/2 (£0 



l/2 l/2 f ||2 /M/ ||2 
V2 1/2,||2 /|| f ||2 



min Hg^Q^/llL/ll/llh 
/eJF- fe nQ m 1/2 (g)^ 



min HQ^g^/ll^/ll/lli 



> C\\k{Qm+s+l) 

for some constant Ci > 0. On the other hand, 

X k+d (QU 2 Qs+iQU 2 ) < max in WU^QU'fW 



/e^_ 1 nQ m 1/2 (g;)- 



' \\J ll£ 2 



max /2 \\Ql£QU 2 f\\lj\\f\\l 2 



1+lQm . 



min llQlilQU'ffcJM 2 



< C2Afc(Q m + s +i) 

for some constant C 2 > 0. Hence A fc (Q™ /2 f/Q^ 2 ) X fc- 2 ( m+s+1 ). 

Because Qm 2 UQm 2 is equivalent to R l l 2 CR l l 2 , following a similar argu- 
ment as before, by the Courant-Fischer-Weyl min-max principle, we com- 
plete the the proof. 
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6.5. Proof of Theorem 6. We now proceed to prove Theorem 6. The 
analysis follows a similar spirit as the technique commonly used in the study 
of the rate of convergence of smoothing splines [see, e.g., Silverman (1982); 
Cox and O'Sullivan (1990)]. For brevity, we shall assume that EX(-) = in 
the rest of the proof. In this case, ao can be estimated by y and /3q by 



(74) 



(3 nX = argmin 
pen 



i - 
n ^ 

i=l 



JJi 



Xi(t)P(t)dt) +XJ(J3) 



T 



The proof below also applies to the more general setting when EX(-) ^ 
but with considerable technical obscurity. 
Recall that 

v 2 

Xi(t)/3(t)dt ' 



(75) 

Observe that 

4o(/3) :=E£ n (/3) = E 



Y 



-i 2 



T 



X(t)P(t)dt 



a 2 + 



a 2 + 



TJT 



[f3(s) - f3 (s)}C(s, t)[f3(t) - #)(i)] ds dt 



Mo- 



Write 
(76) 

Clearly 
(77) 



Pcox = argmin{£ 00 (/3) + AJ(/3)}. 

/sen 



K\ - A) = (Pn\ - Poo\) + (/5ooA - A))- 



We refer to the two terms on the right-hand side stochastic error and deter- 
ministic error, respectively. 

6.5.1. Deterministic error. Write /3o(-) = Y1T=1 a k u; k(') and /3(-) = 
Sfcli &fcWfc(-). Then Theorem 3 implies that 

oo oo 

£oo(P) = o- 2 + X> - a k ) 2 , J(J3) = Y^% lb l 



k=i 



k=l 



Therefore, 
(78) 
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It can then be computed that for any a < 1, 

Aii^Ea+TDfo-ofc) 2 



^ooA 



fe=l 

oo 



fe=i v 



. ,2 (l + 7" Q )7 fc \- -1 2 



-1 oo 



Now note that 



sup ■ 



(l + 7 -aW-l 



T^- <sup 



= A 2 J(/3 )sup 
(1 +x~ a )x~ 1 



k i k=i 



(l + A 7 r 1 )2 ~ x> % (1 + Xx 



-1\2 



X 



-(a+1) 



< SU P 7j | x _ n2 + su P>i , \ -1\2 



1 



+ 



1 



mi^oix 1 / 2 + Xx- 1 / 2 ) 2 mf x>0 (x( a + 1 )/ 2 + Ax-( 1 - a )/ 2 ) 2 

Hereafter, we use Co to denote a generic positive constant. In summary, we 
have 

Lemma 12. If X is bounded from above, then 

||y9ocA-^0||^ = O(A 1 - a J(A)). 

6.5.2. Stochastic error. Next, we consider the stochastic error /3 n \ — /3ooA- 
Denote 

Dl 00 (P)f = -2E x ( [ X(t)[(3 (t)-P(t)]dt [ X(t)f(t)dt 



T 



T 



TJT 



jP (s)-P(s)}C(s,t)f(t)dsdt 
D 2 £ n (P)fg = ^-Yl [ Zi(t)f(t)dt f Xi(t)g(t)dt 
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D 2 toc(P)fg = 2 J J f(8)C(s,t)g(t) ds 



dt. 



Also write 4a (P) = W) + AJ(/3) and 4^ = l^tf) + \J{fi). Denote G\ = 
{l/2)D 2 l ooX {p ooX ) and 

(79) ~P=~PooX-\G~ x 1 Dl nX {~P ooX ). 
It is clear that 

(80) /3 nA - ^ooa = nX -P) + W-Poox). 

We now study the two terms on the right-hand side separately. For brevity, 
we shall abbreviate the subscripts of /3 and /3 in what follows. We begin 
with /3 — (3. Hereafter we shall omit the subscript for brevity if no confusion 
occurs. 



Lemma 13. For any < a < 1, 



(81) 



E\ 



n 



L A -( a +l/(2(r+s)))_ 



Proof. Notice that D£ nX (p) = D£ nX (P) -£>4oa(/3) = D£ n {p) - Dl^tf) . 
Therefore 

E[D£ nX (P)f] 2 = E[D£ n (P)f - Dl^frf] 2 



4 

— Var 

n 



< i -E 
n 



Y 



Y 



X{t)P(t)dt\ J X(t)f(t)dt 
X(t)P(t)dt\ J X(t)f(t)dt 



= -EU X(t)\Po(t)-p(t)]dt J X(t)f(t)dt 

+ ^e(J X(t)f(t)dt\ , 

where we used the fact that e = Y — f X(3q is uncorrelated with X . To bound 
the first term, an application of the Cauchy-Schwarz inequality yields 

E[ I X(t)\Po(t)-p(t)]dt I X(t)f(t)dt 
< < 



JT 

X(t)\Po(t) - P(t)]dt) E[ / X(t)f(t)dt 



T 



T 



\l\\f\\l 



< M\\Pa - 

where the second inequality holds by the second condition of J-{s,M,K). 
Therefore, 



(82) 



E[D£ nX (/3)f] 2 < ~~\\Po 



|2,.r,.2 , 4tj2 n.||2 



\0\\J NO ' 



n 



o- 
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which by Lemma 12 is further bounded by (Cqg /«)||/||o f° r some positive 
constant Co- Recall that ||wfc||o = 1- We have 

(83) E[D£ nX 0)uj k ] 2 < C a 2 /n. 

Thus, by the definition of {3, 



E\ 



\l = E 



^-G^D£ nX 0) 



1 



4 



E 



.k=l 

s^Ed+VKi+H-')- 2 

fe=i 

< 9^L ^(i + k 2a ^)(i + \k 2 ^y 2 



k=l 
2 r°° 



4ri 

Coo 
4n 

C a 2 
4n 

C> 2 



x Mr+s)( 1 + Xx 2(r+s)y2 dx 
(l + Ax 2(r +S )/(2a(r +S ) + l) ) -2 dx 



A -(a+l/(2(r+ S ))) 

4ra 

xn -1 A~ (o+1/(2(r+s))) . 
The proof is now complete. □ 



A a + l/(2(r+s)) 



By definition, 



Now we are in position to bound E 

(84) G x 0-~P) = \D 2 £ ooX 0)0-~P). 
First-order condition implies that 

(85) D£ nX 0) = D£ nX 0) + D 2 £ nX 0)0 -0) = O, 

where we used the fact that £ nX is quadratic. Together with the fact that 

(86) D£ nX 0) + D 2 £ ooX 0)0~P) = Q, 
we have 

£ 2 WW -P) = D 2 £ oo \0)0 -0) + D 2 4ca(W ~ P) 
= D 2 £ ooX 0)0 -p)- D 2 £ nX 0)0 - 0) 
= D 2 £ oo 0)0 -p)- D 2 £ n 0)0 - p). 
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Therefore, 

(87) (ft - ft) = \G^[DH^)0 - ft) - D 2 l n Cm 

Write 



/3 = E^ Wfc and ft = ^b k uj k . 



k=i 



k=0 



Then 



k=l 



1 oo 

iE( 1+A ^ 1 )" 2 ( 1 +^ a ) 

oo f f ( 1 n 

E& - &i) j T j r uj ^ [ -E x *(*) s *w - c ( s ' < ) 

- 2 

x aj^(t) (it 



■3=1 



i=i 



1 oo 

^iE( 1+A ^ 1 )" 2 ( 1 +^ a ) 



fc=l 



£(S i -5 i ) a (i + 77 c ) 



.3=1 



\3=1 



y J UJj(s) ^ ^ Xj(t)Xj(s) - C(s, t) 



x Wfc (t) ds dt 



where the inequality is due to the Cauchy-Schwarz inequality. 
Note that 



^3=1 



/ ( — Xi(t)xi(s) — C(s, t) ] uj k (t) dsdt 



= ~ E^ 1 + %~ c ) _1 Var (/ w 3-(*) x (*) dt j r ^k{t)x{t) d?j 



<^E( 1 + % 7 T 1 ^[(/ 
n =1 Lvr / 



h 1/2 



w fe (t)X(i)dt 



4-i 1/2 



T 



^£W*[(/ 7 

j=1 L W / 
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E 



T 



LU k (t)X(t)dt" 



= -D 1 +% 7C )"^n- 1 , 
provided that c > l/2(r + s). On the other hand, 

oo 



fe=i 



;=1 

(i + \ x ^+°)y 2 x Mr+s) dx 

(l + Ax 2 ^ +s )/( 2 °( r + s ) +1 ))- 2 ^ 



_ A -(a+l/(2(r+s))) / ^ + x 2(r+ S )/(2a(r+5)+l))-2 dx 

7 A a + l/(2(r + s)) 

x A -(a+l/(2(r+s)))_ 
To sum up, 

(89) \\P ~ = O p (n- 1 \-^+ 1 ^ r+s ^ ||/3 - /3|| 2 ). 
In particular, taking a = c yields 

(90) ||/3 - /3|| 2 = OpCn^A-^VWH-.)))^ - /3|| 2 ). 
If 

(91) ra -i A -( c + 1 /(2M))^ 0) 
then 

(92) ||/3-/3|| c = 0p (||/3-£|| c ). 
Together with the triangular inequality 

(93) ||/3 - p\\ c > ||/3 - p\\ e - ||/3 - p\\ c = (1 - 0p (l)) ||/3 - £|| c . 
Therefore, 

(94) \\$-P\\c = O p QP-P\\ c ) 
Together with Lemma 13, we have 

(95) ||/3 - /3|| 2 = O p (n- 1 A-t° +1 /(2(r+-)))) = 0p (i). 
Putting it back to (89), we now have: 
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Lemma 14. If there also exists some l/2(r + s) < c < 1 such that n 1 x 
A -( c +i/2(r+ s )) _^ 0; then 

(96) ||^-/3||2 = 0p (n- 1 A-( a + 1 /^))). 



lim lim sup P(\\Pn\- Po\\l > Dn- 2 ^ 1 - a ^ r+s '>/^ r+s '> +1 '>) 



Combining Lemmas 12-14, we have 

(97) 

= 

by taking A x n -2(»-+«)/(2(r+*)+i) . 

6.6. Proo/ o/ Theorem 7. We now set out to show that n -2(i-a)(r+a)/(2(r+a)+i) 
is the optimal rate. It follows from a similar argument as that of Hall and 
Horowitz (2007). Consider a setting where ip k = (f> k , k = 1, 2, Clearly in 

— 1/2 

this case we also have uj k = n k 4>k- It suffices to show that the rate is 
optimal in this special case. Recall that /3q = ^2a k cj) k . Set 



(98) a fc 



L n 1/2 k- r 6 k , L n + l<k<2L n , 
0, otherwise, 



where L n is the integer part of n 1 /( 2 ( r+s ) +1 ) , and is either or 1. It is 
clear that 

2L 

(99) IIA)||ir< E L « 1 = 1 - 

k=L n +l 

Therefore /3q £ %. Now let X admit the following expansion: X = ^2 k ^kk~ s (j)k 
where £ k s are independent random variables drawn from a uniform distri- 
bution on [— y/3, v3] . Simple algebraic manipulation shows that the distri- 
bution of X belongs to J-(s,3). The observed data are 

2L n 

(100) Vi= Ln 1/2 k- {r+s) aik0k + ei, i = l 

k=L„+l 



,...,n, 



where the noise e$ is assumed to be independently sampled from N(0, M?)- 
As shown in Hall and Horowitz (2007), 

(101) lim inf inf sup£(0 7 - - OA 2 > 0, 

n^oo L n <j<2L n q. W J ' 

where sup denotes the supremum over all 2 Ln choices of (9l„+i, ■ ■ ■ , 02L„), 
and infg is taken over all measurable functions 9j of the data. Therefore, for 
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any estimate /3, 

sup ||/3 -h\\l = sup Yl L^k-^^E^-O.f 

k=L n +l 

(102) 

for some constant M > 0. 
Denote 

- f 1, 4 > 1, 

(103) h = {o k , o<e k <i, 

{ o, ~e k < 0. 

It is easy to see that 

2L,. 

^ r -l fc -2(l-a)(r+sU . 

y J 3 



k=L n +l 

(104) 

*)(r+s)(fj._ 0j \2 

k=L„+l 



Hence, we can assume that < 0j < 1 without loss of generality in estab- 
lishing the lower bound. Subsequently, 

k=L n +l k=L n +l 

(105) 

< r- 2(l-a)(r+a) 

Together with (102), this implies that 

(106) lim inf supP(||/3 - ml > d n -2(i-a)(r+»)/(2(r+*)+lh > q 
for some constant d > 0. 



APPENDIX: SACKS-YLVISAKER CONDITIONS 

In Section 3, we discussed the relationship between the smoothness of 
C and the decay of its eigenvalues. More precisely, the smoothness can 
be quantified by the so-called Sacks- Ylvisaker conditions. Following Ritter, 
Wasilkowski and Wozniakowski (1995), denote 

Q + = {(s,t) G (0,l) 2 :s>t} and 

(107) 

n_ = {(a,t) e (0,l) 2 :s<t}. 
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Let cl(A) be the closure of a set A. Suppose that L is a continuous func- 
tion on f2 + U f2_ such that L\^ j is continuously extendable to c\(Qj) for 
j € {+,—}• By Lj we denote the extension of L to [0,1] 2 , which is con- 
tinuous on cl(Qj), and on [0, l] 2 \ cl(f2j). Furthermore write M^ k ' l \s,t) = 
(d k+l /{ds k dt l ))M(s,t). We say that a covariance function M on [0, l] 2 sat- 
isfies the Sacks- Ylvisaker conditions of order r if the following three condi- 
tions hold: 

(A) L = M( r ' r ) is continuous on [0, l] 2 , and its partial derivatives up to order 
2 are continuous on Q + U f2_, and they are continuously extendable to 
cl(fi+) and cl(fi_). 

(B) 

(108) min (L {1,0) (s,s)-lV 1 '° ) (s,s))>0. 

0<s<1 

(C) L+'°\s, •) belongs to the reproducing kernel Hilbert space spanned by 
L and furthermore 

(109) sup ||4 2 ' 0) ( S ,-)|| L <oo. 

0<s<l 
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